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SYM-ILDL: Incomplete LDL^ Factorization of Symmetric Indefinite and 
Skew-Symmetric Matrices 

Chen Greif, Shiwen He, and Paul Liu, University of British Columbia, Vancouver, Canada 


SYM-ILDL is a numerical software package that computes incomplete LDL^ (or ‘ILDL’) factorizations 
of symmetric indefinite and real skew-symmetric matrices. The core of the algorithm is a Crout variant 
of incomplete LU (ILU), originally introduced and implemented for symmetric matrices by [Li and Saad, 
Crout versions of IL U factorization with pivoting for sparse symmetric matrices, Transactions on Numerical 
Analysis 20, pp. 75—85, 2005]. Our code is economical in terms of storage and it deals with real skew- 
symmetric matrices as well, in addition to symmetric ones. The package is written in C-|—h and it is 
templated, open source, and includes a Matlab"*"^^ interface. The code includes built-in ROM and AMD 
reordering, two equilibration strategies, threshold Bunch-Kaufman pivoting and rook pivoting, as well as a 
wrapper to MC64, a popular matching based equilibration and reordering algorithm. We also include two 
built-in iterative solvers: SQMR, preconditioned with ILDL, and MINRES, preconditioned with a symmetric 
positive definite preconditioner based on the ILDL factorization. 


1. INTRODUCTION 

For the numerical solution of symmetric and real skew-symmetric linear systems of the form 

Ax = b, 

stable (skew-)symmetry-preserving decompositions of A often have the form 

PAP^ = LDL^, 


where L is a (possibly dense) lower triangular matrix and U is a bloc k-diagonal matrix 
with 1-by-l and 2-by-2 blocks [Bunch 1982[ Bunch and Kaufman 1977] . The matrix P is 
a permutation matrix, satisfying PP^ = I, and the right-hand side vector b is permuted 
accordingly: in practice we solve {PAP^){Px) = Pb. 

In the context of incomplete LDL"'’ (ILDL) decompositions of sparse and large matrices for 
preconditioned iterative solvers, various element-dropping strategies are commonly used to 
impose sparsity of the factor, L. Fill-reducing reordering strategies are also used to encourage 
the sparsity of L, and various scaling methods are applied to improve conditioning. For 
a symmetric linear system, several methods have been developed. Approaches h ave been 
proposed which perturb or partition A so that i ncompl ete Cholesky may be used Lin and 
More 1999 [Orban 2014| Scott and Tuma 2014aj . While Lin and More 19 99| is designed tor 


positive definite matrices, the recent papers of Orban 2014 and Scott and Tuma 2014a 

are applicable to a large set of 2 x 2 block structured indehiiite systems. 

We present SYM-ILDL — a software package based on a left-looking Crout version of 
LU, which is stabilized by pivoting strategies such as Bunch-Kaufman and rook pivoting. 
The algorithmic principles underlying our s oftware are based o n (and extend) an incomplete 
LDL"*" factoriz ation approach proposed by 
Li et al. 2003| 


and Jones and Plassmann 


Li and Saad 2005 , which itself extends work by 


995 . We offer the following new contributions: 


— A Grout-based incomplete LDL"'" factorization for real skew-symmetric matrices is in¬ 
troduced in this paper for the first time. It features a similar mechanism to the one for 
symmetric indefinite matrices, but there are notable differences. Most importantly, for real 
skew-symmetric matrices the diagonal elements of D are zero and the pivots are always 
2x2 blocks. 

— We offer two integrated preconditioned solvers. The first solver is a preconditioned MIN¬ 
RES solver, specialized to our ILDL code. The main challenge here is to design a positive 
definite preconditioner, even though ILDL produces an indefinite (or skew-symmetric) 
factorization. To that end, for the symmetric case we implement the technique presented 
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Gill et al. 1992 


For the skew-symmetric case we introduce a positive definite precon- 
ditioner based on exploiting the simple 2x2 structure of the pivots. The second s olver 
is a preconditioned SQMR solver, based on the work of [Freund and Nachtigal 1994|. For 
SQMR, we use ILDL to precondition it directly. 

— The code is written in C-|—k, and is templated and easily extensible. As such, it can be 
easily modified to work in other fields of numbers, such as C. SYM-ILDL is self-contained 
and it includes implementations of reordering methods (AMD and ROM), equilibration 
methods (in the max-norm, 1-norm, and 2-norm), and pivoting methods (Bunch-Kaufman 
and rook pivoting). Additionally, we provide a wrapper that allows the user to use the 
popular HSL_MC64 library to reorder and equilibrate the matrix. To facilitate ease of use, a 
Matlab^“ MEX file is provided which offers the same performance as the C-I--I- version. 
The MEX file simply bundles the C-I--I- library with the Matlab"'"^ interface, that is, the 
main computations are still done in C-k-k. 


Incomplete factorizations of symmetric indefinite matrices have received much att ention 
recently and a few numerical packages have been developed in the past few years. [Scott 


and Tuma 2014a have developed a numerical software package based on signed incomplete 


Cholesky factorization preconditioners due to [Lin and More 1999[. For saddle-point sys- 
s, [Scott and Tuma 2014a hav e extended their limited memory incomplete Cholesky 
irithm Scott and Tuma 2014b to a signed inco mplet e Cholesky fact orization. Their 
approach builds on the ideas of [Tismenetsky 1991[ and [Kaporin 1998[. In the case of 

1_l.j_ ! _• -A. • J 'IV :_ J inr>r>h 


b reakdown (a zero pivot), a global shift is applied (see also Lin and More 19^). 


Scott and Tuma [2014a Section 6.4] have made comparisons with our code, and have 


found that in general, the two codes are comparable in performance for several of the test 
problems, whereas for some of the problems each code outperforms the other. However, the 
package they compared was an earlier release of SYM-ILDL. Given the numerous improve¬ 
ments made upon the code since then, we repeat their comprehensive comparisons and show 


th at SYM-ILDL now performs better than the preconditioner of Scott and Tuma 2014a . 

[Orban 2014 has developed LL DL, a generalization of the limited-memory Cholesky fac¬ 
torization of [Lin and More 1999[ to the symmetric indefinite case with special interest in 
symmetric quasi-definite matrices. The code generates a factorization of the form LDL"*" 
with D diagonal. We are currently engaged in a comparison of our code to LLDL. 

The remainder of this paper is structured as follows. In Sectionj^we outline a Crout-based 
factorization for symmetric and skew-symmetric matrices, symmetry-preserving pivoting 
strategies, equilibration approaches and reordering strategies. In Section[^we discuss how to 
modify the output of SYM-ILDL to produce a positive definite preconditioner for MINRES. 
In Section [4| we discuss the implementation of SYM-ILDL, and how the pivoting strategies 
of Section[^may be efficiently implemented within SYM-ILDL’s data structures. Einally, we 
compare SYM-ILDL with other software packages and show the performance of SYM-ILDL 
on some general (skew-)symmetric matrices and some saddle-point matrices in Section]^ 


2. LDL AND ILDL FACTORIZATIONS 

SYM-ILDL uses a Grout variant of LU factorization. To maintain stability, SYM-ILDL al¬ 
lows the user to choose one of two symmetry-preserving pivoting strategies: Bunch-Kaufman 
partial pivoting [Bunch and Kaufman 1977[ (Bunch in the skew-symmetric case [Bunch 
I982[) and rook pivoting. The details of the factorization and pivoting procedures, as well 
as simplifications for the skew-symmetric case, are provided in the following sections. See 
also [Duff 200^ for more details on the use of direct solvers for solving skew-symmetric 
matrices. 
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2.1. Crout-based factorizations 

The Grout order is an attractive way for computing an ILDL factorization of symmetric 
or skew-symmetric matrices, because it naturally preserves structural symmetry, especially 
when dropping rules for the incomplete factorization are applied. As opposed to the IKJ- 
based approach [Li and Saad 2005] , Grout relies on computing and applying dropping rules 
to a column of L and a row of u simultaneously. The Grout procedure for a symmetric 
matrix is outlined in Algorithm [^ using a delayed update procedure for the factors which 
is laid out in Algorithm]^ (As shown in Algorithm]^ the procedure in Algorithmmay be 
called multiple times when various pivoting procedures are employed.) 


ALGORITHM 1: fc-th column update procedure 

Input: A symmetric matrix A, partial factors L and D, matrix size n, current column index k 
Output: Updated factors L and D 

1 ^k:n,k ^ Ak:n,k 

2 t 1 

3 while i < k do 

4 Si ■<— size of the diagonal block with Di^i as its top left corner 

5 Lk:n,k ^ Lk:n,k Lk:n,i:i-^-Si — l iii+s ■ — 1 "^fc.iii + Si — 1 

6 i <— i + Si 

7 end 


ALGORITHM 2: Grout factorization, LDL"'’C 
Input: A symmetric matrix A 

Output: Matrices P, L, and D, such that PAP « LDLF 

1 fc 1 

2 L 0 

3 H ■<— 0 

4 while k < n do 

5 Call Algorithm to update L 

6 Find a pivoting matrix in Ak-n,k-.n and permute A and L accordingly 

7 s size of the pivoting matrix 

8 if s = 2 then 

9 Update the k -\- 1-th column of L 

10 end 

11 Pk:k + s — l,k:k + s — l ^ I-'k:k + s — l ,k:k + s — \ 

12 Lk:n,k-.k + s-l Lk:n,k:k + s-lD 

13 Apply dropping rules to Lk+s:n,k:k+s-i 

14 k k + s 

15 end 


For computing the ILDL factorization, we apply dropping rules; see line 10 of Algorithmj^ 
These are the standard rules: we drop all entries below a pre-specified tolerance (referred 
to as drop_tol throughout the paper), multiplied by the norm of a column of L, keeping 
up to a pre-specified maximum number of the largest nonzero entries in every column. We 
use here the term actor to signify the maximum allowed ratio between the number 

of nonzeros in any column of L and the average number of nonzeros per column of A. 

In Algorithmic the s x s pivot is typically I x 1 or 2 x 2, as per the strategy devised by 
Bunch and Kaufman 197^, which we briefly describe next. 
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2.2. Symmetric partial pivoting 

Pivoting in the symmetric or skew-symmetric setting is challenging, since we seek to preserve 
the (skew-)symmetry and it is not sufficient to use 1x1 pivots to maintain stability. Much 


work has been done in this front; see, for example, [Duff et al. 1989 Duff et al. 1991 
and Scott 201^and the references therein. 


Hogg 


Bunch and Kaufman 1977 proposed a partial pivoting strategy for symmetric matrices, 
which relies on hnding 1x1 and 2x2 pivots. The cost of finding a pivot is 0{n), as it only 
involves searching up to two columns. We provide this procedure in Algorithm For all 
pivoting algorithms below, we assume that we are pivoting on the Schur complement (i.e. 
column 1 is the A:-th column if we are on the fc-th step of Algorithm]^. 


ALGORITHM 3: Bunch-Kaufman LDL"’’ using partial pivoting strategy 

1 (l-H^)/8(« 0.64) 

2 cui •<— maximum magnitude of any subdiagonal entry in column 1 

3 if |aii| > auii then 

4 Use Oil as a 1 x 1 pivot (s = 1) 

5 else 

6 r ■<— row index of first (subdiagonal) entry of maximum magnitude in column 1 

7 uir maximum magnitude of any off-diagonal entry in column r 

8 if \ai\\ujr > awi then 

9 Use flu as a 1 X 1 pivot (s = 1) 

10 else if \arr\ > Q-ijJr then 

11 Use Urr as a 1 X 1 pivot (s = 1, swap rows and columns 1, r) 

12 else 

13 Use ( J as a 2 X 2 pivot (s = 2, swap rows and columns 2, r) 

14 end 

15 end 


The constant a = (1 -I- •\/l7)/8 in line 1 of the algorithm controls the growth factor, and 
ttij is the ij-th entry of the matrix A after computing all the delayed updates in Algorithm[^ 
on column i. Although the partial pivoting strategy is backward stable [Higham 2002] , the 
possibly large elements in the unit lower triangular matrix L may cause numerical difficulty. 
Rook pivoting provides an alternative that in practice proves to be more stable, at a modest 
additional cost. This procedure is presented in Algorithm The algorithm searches the 
pivots of the matrix in spiral order until it finds an element that is largest in absolute value 
in both its row and its column, or terminates if it finds a relatively large diagonal element. 
Although theoretically rook pivoting could traverse many columns, we have found that it 
is as fast as Bunch-Kaufman in practice, and we use it as the default pivoting scheme of 
SYM-ILDL. 

2.3. Equilibration and reordering strategies 

In many cases of practical interest, the input matrix is ill-conditioned. For these cases, 
equilibration schemes have been shown to be effective in lowering the condition number of 
the matrix. Symmetric equilibration schemes rescale entries of the matrix by computing a 
diagonal matrix D such that DAD has equal row norms and column norms. 


SYM-ILDL offers three equilibration schemes. Two of the equilibration schemes are built- 
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ALGORITHM 4: LDL"*" using rook pivoting strategy 

1 (1 +VT7)/8(«0.64) 

2 uji ■(— maximum magnitude of any subdiagonal entry in column 1 

3 if I an I > auj\ then 

4 Use an as a 1 x 1 pivot (s = 1) 

5 else 

6 i <— 1 

7 while a pivot is not yet chosen do 

8 r <r- row index of first (subdiagonal) entry of maximum magnitude in column i 

9 LUr <— maximum magnitude of any off-diagonal entry in column r 

10 if \arr\ > then 

11 Use Orr as a 1 X 1 pivot (s = 1, swap rows and columns 1 and r) 

12 else if uji = Ur then 

13 Use I 1 as a 2 X 2 pivot (s = 2, swap rows and columns 1 and i, and 2 and r) 

yUri OrrJ n \ , n ! 

14 else 

15 i r 

16 LOi iJr 

17 end 

18 end 

19 end 


2.3.1. Bunch's equilibration. Bunch’s equilibration allows the user to scale the max norm of 
every row and column to 1 before factorization. Let T be the lower triangular part of A 
in absolute value (diagonal included), that is, = \Aij\, 1 < j < i < n. Then Bunch’s 
algorithm runs in 0(nnz(A)) time, and is based on the following greedy procedure: 

For 1 < i < u, set 


D,i := 


max. <\/Tii, max 


2.3.2. Ruiz's equilibration. Ruiz’s equilibration allows the user to scale every row and co lumn 
of the matrix to 1 in any Lp norm, provided that p > 1 and the matrix has support Ruiz 


2001|. For the max norm, Ruiz’s algorithm scales each column’s norm to within e of 1 in 


0{nnz{A) log j) time for any given tolerance e. We use a variant of Ruiz’s algorithm that 
is similar in spirit, but produces different scaling factors. 

Let r(A, i) and c{A, i) denote the i -th row and column of A respectively, and let D(i, a) 
to be the diagonal matrix with Djj = 1 for all j ^ i and Du = a. Using this notation, our 
variant of Ruiz’s algorithm is shown in Algorithm]^ 

Our presentation differs from Ruiz’s original algorithm in that it operates on one row 
and column at a time as opposed to operating on the entire matrix in each iteration. We 
implemented the algorithm this way as it naturally adapts to our storage structures; our code 
is more easily amenable to single column operations rather than matrix-vector products. 
Hence Ruiz’s original implementation and ours produce quite different scaling matrices. 
However, a proof of correctness similar to that of of Ruiz’s algorithm applies, with the same 
guarantee for the running time. 

2.3.3. Matching-based equilibration. The use of weighted matchings provides an effective tech¬ 
nique to improve the stability of computing factorizations. In many cases, reorderings based 
on weighted matchin gs provided an effective form of static pivoting for tough indefinite 
symmetric problems Hagemann and Schenk 2006 . Our code provides a wrapper to the 


well-known HSL_MC64 software package, which implements a matching-based equilibration 
algorithm. When MC64 is installed, our code will use a symmetrized variant of it to gener- 
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ALGORITHM 5: Equilibrating general matrices in the max-norm 


1 

2 

3 

4 

5 

6 

r 

8 

9 

10 

11 


Input: A general matrix A 

Output: Diagonal matrices R and C such that RAC has max-norm 1 in every row and column 

R^l 

C 

A 

while R and C have not yet converged do 
for i := 1 to n do 

OLr ^ / 1 

\/lk(A,i)||oo 
Oc •<- , 1 

VlWXiilU 
R R ■ D{i, Ur) 

C ^ C ■ D{i,aJ) 

A ■(— D{i, ar)AD(i, aA 

end 


12 end 


ate a scaling matrix that scales the max norm of ever; 
regarding the functionality of MC64 can be found in [Duff and Koster 2001 
manual of the HSL Mathematical Software Library. 


row and column to 1. More details 
?1 and in the 


2.3.4. Comparison of equilibration strategies. Ruiz’s strategy seems to perform well in terms 
of preserving diagonal dominance when no reordering strategy is used. In fact, we have ob¬ 
served that for certain skew-symmetric systems, Ruiz’s equilibration leads to convergence 
of the iterative solver, while Bunch’s approach does not. On the other hand, Bunch’s equi¬ 
libration strategy is faster, being a one-pass procedure. When MC64 is available, its speed 
and scaling are comparable with Bunch’s algorithm. However there are some matrices in 
our test suite for which MC64 provides a suboptimal equilibration. In our experiments we 
use Bunch’s algorithm as the default. 


2.3.5. Fill-reducing reorderings. After equilibration, we carry out a reordering strategy. The 
user is given the option of choosing from Approximate Minimum Degree (AMD) [Amestoy 


et al. 19^, Reverse Cuthill-McKee (ROM) [George and Liu 1981 , and MC64. AMD and 


ROM are built into SYM-ILDL, but MC64 requires the installation of an external library. 
Whereas RCM and AMD are meant to reduce HU, MC64 computes a symmetric reordering 
of the matrix so that larger elements are placed near the diagonal. This has the effect of 
improving stability during the factorization, but may increase fill. Though there are matrices 
in our tests for which MC64 reordering is effective, we have found reducing fill to be more 
important, as our pivoting procedures deal with stability issues already. For the purpose 
of improving diagonal dominance while reducing fill, a common strategy is to combine 
MC64 with a fill-reducing reo rdering such as AMD or METIS [Schenk a nd Gartner 200^ 
Hagemann and Schenk 2006 . We use the procedure described in Hagemann and Schenk| 


2006 , which first preprocesses the matrix with MC64 and then compresses 2x2 pivots 


identified during the matching. The rows/columns corresponding to the pivots have their 
zero patterns merged and is replaced by a single row/column. Then AMD is run on the 
condensed matrix, after which the pivots are expanded back into two rows and columns. 
This procedure is implemented in HSL_MC80. Although MC80 is not built into our package, 
the results obtained by MC80 are comparable to those obtained by AMD and MG64. These 
results can be found in the appendix. For reducing fill, we have found both MG80 and AMD 
to be effective for our test cases. As MC80 requires an external library, AMD is set as the 
default in the code. 
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2.4. LDL and ILDL factorizations for skew-symmetric matrices 

The real skew-symmetric case is different than the symmetric indefinite case in the sense 
that here, we must always use 2x2 pivots, because diagonal elements of real skew-symmetric 
matrices are zero. This simplifies both the Bunch-Kaufman and the Rook pivoting proce¬ 
dure: we have only one case in both scenarios. Algorithm illustrates the simplification 
for rook pivoting (the simplification for Bunch is similar). Furthermore, as opposed to a 
typical 2x2 symmetric matrix, which is defined by three parameters, the analogous real 
skew-symmetric matrix is defined by one parameter only. As a result, at the /cth step, the 
computation of the multiplier and the subsequent update of pair of columns associated with 
the pivoting operation can be expressed as follows: 


which can be trivially computed by swapping columns k and fc -|- 1 and scaling. 


— -^k+2:n,k:k+l 
1 


O-k+lM 


-A, 


0 —ak+i,k 

^k-\-l,k 0 

0 1 


k-\-2:n,k:k-\-l 1 _^ q 


ALGORITHM 6: LDL"*" using rook pivoting strategy for skew-symmetric matrices 

1 cui maximum magnitude of any subdiagonal entry in column 1 

2 i •<— 1 

3 while a pivot is not yet chosen do 

4 r row index of first (subdiagonal) entry of maximum magnitude in column i 

s oJr maximum magnitude of any off-diagonal entry in column r 

6 if cui = Ur then 

7 Use ( I as a 2 X 2 pivot (swap rows and columns 1 and i, and 2 and r) 

XCLri vJ J 

8 else 

9 i r 

10 Ui Ur 

11 end 

12 end 


The ILDL factorization for skew-symmetric matrices can thus be carried out similarly 
to the manner in which it is developed for symmetric indefinite matrices, but the eventual 
algorithm gives rise to the above described simplifications. Skew-symmetric matrices are 
often ill-conditioned, and we have experimentally found that computing a numerical solution 
effectively for those systems is challenging. More details are provided in Section 

3. ITERATIVE SOLVERS FOR SYM-ILDL 

In SYM-ILDL we implement two preconditioned iterative solvers: SQMR and MINRES. 
For SQMR, we can use the incomplete LDL^ factorization as a preconditioner directly. For 
MINRES, we require the preconditioner to be positive definite, and modify our LDL"’" as 
described in Section [3l 

In our experiments, SQMR usually took fewer iterations to converge, with the same 
arithmetic cost per iteration. However, we found MINRES to be useful and more stable 
in difficult problems where SQMR stagnates (see Section]^. In these problems, MINRES 
returns a solution vector with a much smaller residual than SQMR in fewer iterations. 
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3.1. A specialized preconditioner for MINRES 

We describe below techniques for generating MINRES preconditioned iterations, using pos¬ 
itive definite versions of the incom plete factorizatio n. For the symmetric indefinite case, 


we apply the method presented in Gill et al. 1992 . Given M = LDL?' ^ let us focus our 


attention on the various options for the blocks of Zl. Our ultimate goal is to modify D and L 
such that D is diagonal with only 1 or —1 as its diagonal entries. If a block of the matrix D 
from the original LDL factorization was 2x2, then the corresponding modified (diagonal) 
block would become 

±1 0 
0 Tl 


For a diagonal entry of D that appears as a 1 x 1 block, say, di^i, we rescale the ith row 
of L: L{i ,:) —)• L{i, :)^/\di^i\. We can then set the new value of di^i as sgn((ii_i). In practice 
there is no need to perform a multiplication of a row of L by y/\di^] instead, this scalar is 
stored separately and its multiplicative effect is computed as an 0(1) operation for every 
matrix vector product. 

Now, consider a 2 x 2 block of H, say Dj. For this case, we compute the eigendecomposition 

~ Qj^jQj I 


and similarly to the case of a 1 x 1 block, we implicitly rescale two rows of L by QjyJ\K^\. 
This means that L is no longer triangular; it is in fact lower Hessenberg, since some values 
above the main diagonal may become nonzero. But the solve is just as straightforward, since 
the decomposition is explicitly given. 

Since L was originally a unit lower triangular matrix that was scaled by positive scalars, 
is symmetric positive definite and we use it as a preconditioner for MINRES. Note that 
if we were to compute the full LDL"'" decomposition and scale L as described above, then 
MINERS would converge within two iterations (in the absence of roundoff errors), thanks 
to the two eigenvalues of D, namely 1 and —1. 

In the sk ew-symmetric case, we may use a specialized version of MINRES 
Varah 2009 . We only have 2x2 blocks, and for those, we know that 


Greif and 



^3,3 1 



0 ) 

1 




'3.3 \ 


0 


0 


'0,3 


0 ±I 
Tl 0 




'3,0 I 




0 


'3,0 


Therefore, we do not need an eigendecomposition (as in the symmetric case), and instead 
we just scale the two affected rows of L by ^J\ajj\l 2 - 

Figure shows the clustering effect that the proposed preconditioning approach has. We 
generate a random real symmetric 300 x 300 matrix A (with entries drawn from the standard 
normal distribution), and compute the eigenvalues of {LDLA')~'A, where L and D are the 
matrices generated in the above described preconditioning procedure. Our fill factor is 2.0 
and the drop tolerance was 10“"^. We note that the eigenvalue distribution in the figure is 
typical for other cases that were tested. 


4. IMPLEMENTATION 

4.1. Matrix storage in SYM-ILDL 

Since we are dealing with symmetric or skew-symmetric matrices, one of our goals is to avoid 
duplicating data. At the same time, it is necessary for SYM-ILDL to have fast column 
access as well as fast row access. In terms of storage, we deal with these requirements 
by generating a format similar to standard compressed sparse column form, along with 
compressed sparse row form without the nonzero floating point matrix values. Matrices are 
stored in a list-of-arrays format. Each column is represented internally as two dynamically 
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Fig. 1: Eigenvalues of a preconditioned symmetric random 300 x 300 matrix (red). Note 
the clustering at 1 and -1. Entries in the matrix are drawn from the standard normal 
distribution. The unpreconditioned eigenvalues are in blue. 


sized arrays, storing both its nonzero valnes col_val and row indices (col_list). These 
arrays facilitate fast random accesses as well as removals from the middle of the array (by 
simply swapping the value to be deleted to the end of the array, and decrementing its size 
by 1). Meanwhile, another array holds pairs of pointers to the two column arrays of each 
column. One advantage of this format is that swapping columns and deallocating their 
memory is much easier, as we only need to operate on the array holding column pointers. 
Additionally, a row-major data structure (row_list) is used to maintain fast access across 
the nonzeros of each row (see Figure]^. This is obtained by representing each row internally 
as a single array, storing the column indices of each row in an array (the nonzero values are 
already stored in the column-major representation). 

Our format i s an improvement over storing the full matrix in standard CSC, as used in [Li| 
and Saad 2005|. Assuming that the row and column indices are stored in 32-bit integers and 
the nonzero values are stored in 64-bit doubles, this gives us an overall 33% saving in storage 
if we were to store the factorization in-place. This is an easy modihcation of Algorithm 
In the default implementation, we find it more useful to store an equilibrated and permnted 
copy of the original matrix, so that we may use it for MINRES after the preconditioner is 
computed. An in-place version that returns only the preconditioner is included as part of 
our package. 


4.2. Data structures for matrix access 

In ILUC [Li and Saad 2005), a bi-index data structure was developed to ad dress two imple¬ 


ment ation ditficulties in sparse matrix operations, following earlier w ork by Eisenstat et al. 

and I Jones and Plassmann 


1981 


in the context of the Yale Sparse Matrix Package (YSMP) 

, Our implementation uses a similar bi-index data structure, which we briefly describe 
below. 

Internally, the column and row indices in the matrix are stored in partial order, with one 
array per column a nd row. On the fc-th iteration, elements are partially sorted so that all row 
indices less than k are stored in a contiguous segment per column, and all row indices greater 
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list: 

col_ 

.list: 

col. 

.val: 

[0] 

0: 

[0,1,2,4] 

0: 

m,m, 

[0,1] 

1: 

[1] 

1: 

[ ] 

[0,2] 

2: 

[2,4] 

2: 

»,■] 

[] 

3: 

[] 

3: 

[] 

[0,2,4] 

4: 

[4] 

4: 

[ ] 


col_first: 
row first: 


[2,null,0,null,0] 
[0,1,1,null,1] 


M 


Fig. 2: Graphical representation of the data structures of SYM-ILDL. col_first and 
row_first are shown during the third iteration of the factorization. Hence col_first 
holds the values of indices in col_list for the first element under or on the third row of the 
matrix. Similarly, row_first holds the values of indices in row_list for the last element 
not exceeding the third column of the matrix. 


or equal to k are stored in another contiguous segment. Within each segment, the elements 
are unsorted. This avoids the cost of sorting whenever we need to pivot. Since elements 
are partially sorted, accessing specific elements of the matrix is difficult and requires a slow 
linear search. Luckily, because Algorithm accesses elements in a predictable fashion, we 
can speed up access to subcolumns required during the factorization t o 0(1) amortized time. 
The s trategy we use to speed up matrix access is similar to that of [Jones and Plassmann 
1995 . To ensure fast access to the submatrix Lk+i-n,i:k and the row L/jd during factorization, 
we use one additional length n array: col_first. On the k-th iteration, the i-th element 
of col_first holds an offset that stores the dividing index between indices less than k and 
greater or equal to k. In effect, col_f irst gives us fast access to the start of the submatrix 
Lk+i:n,i in col_list and speeds up Algorithm]^ allowing us access to the submatrix in 
0(1) time. To get fast access to the list of columns that contribute to t he u pdate of the 
(fc+ l)-st column, we use the row structure row_list discussed in Section 4.1 To speed up 
access to row_list, we maintain a row_first array that is implemented similarly to the 
col_first. Overall, this reduces the access time of the submatrix Lk+i-.n,i:k and row Lk 
down to a cost proportional to the number of nonzeros in these submatrices. 

Before the first iteration, col_first(i) is initialized to an array of all zeros. To ensure 
that col_first(i) stores the correct offset to the start of the subcolumn Lk+i-.n,i on step 
k, we increment the offset for col_first(i) (if needed) at the end of processing the k- 
th column. Since the column indices in col_list are unsorted, this step requires a linear 
search to find the smallest element in col_list. Once this element is found, we swap it 
to the correct spot so that the column indices for Lk+i:n,i are in a contiguous segment of 
memory. We have found it helpful to speed up the linear search by ensuring the indices of 
A are sorted before beginning the factorization. This has the effect that A remains roughly 
sorted when there are few pivot steps. 

Similarly, we will also need to access the subrows A/, i-k and A^^i-k during the pivot¬ 
ing stage (lines 11 to 15 in Algorithm and Algorithm Q. This is sped up by an analo¬ 
gous row_first(i) structure that stores offsets to the end of the subrow Ai^i.,k {Ai^i-^ is 
the memory region that encompases everything from the start of memory for that row to 
row_f irst (i)). At the end of step k, we also increment the offsets for row_first if needed. 

A summary of data structures can be found in Table |l| 
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Table I: Variable names with data structure types 


Variable name 

Data structure type 

Purpose 

col_first 

n length array 

Speeds up access to ii i-®., row_list 

row_first 

n length array 

Speeds up access to Ai i-fc, i.e., col_list 

row_list 

n dynamic arrays (row-major) 

Stores indices of A across the rows 

col_list 

n dynamic arrays (col-major) 

Stores indices of A across the columns 

col_val 

n dynamic arrays (col-major) 

Stores nonzero coefficients of A 


5. NUMERICAL EXPERIMENTS 


All our experiments were run single threaded, on a machine with a 2.1 GHz Int el Xeon CPU 
and 128 GB of RAM . In the experiments below, we follow the conventions of |Li and Saad 


2005| |Li et al. 2003 and define the fill of a factorization as nnz(L + D + L^)/nnz(A). Recall 
that the fill_factor is the maximum allowed ratio between the number of nonzeros in 
any column of L and the average number of nonzeros per column of A. Therefore, the fill 
of our preconditioner is bounded by approximately 2 • fill_f actor; the factor of 2 arises 
from the symmetry. 


5.1. Results for symmetric matrices 

5.1.1. Tests on general symm etric indefinite matrices. For testing our code, we use the University 
of Florida (UF) collection [Davis and Hu 201 1|, as well as our own matrices. The UF 


collection provides a variety of symmetric matrices, which we specify in Tables [III and j VHI 


We have used some of the same matrices th at have been used in the papers |Li and Saad 


IE 


2005 Li et al. 2003 Scott and Tuma 2014a 


Table |ll| we show the results of experiments with a set of matrices from [Davis and 


Hu 2011 as well as comparisons with MATLAB’s ILUTP. The matrix dimensions go up to 


approximately four million, with number of nonzeros going up to approximately 100 million. 
We show timings for constructing the ILDL factorization and an iterative solution, applying 
preconditioned SQMR for SYM-ILDL and GMRES(20) for ILUTP with drop tolerance 10“^ 
for a maximum of 1000 iterations. We apply either Bunch’s equilibration or MG64 scaling 
and either AMD or MC64 reordering (MC64R) before generating the ILDL factorization. 
Preconditioned SQMR is run with SYM-ILDL for a maximum of 1000 iterations. We show 
the best results in Table [H] out of the 4 possible reordering and equilibration combinations 
for both ILUTP and ILDL. We have also tested the ILDL preconditioner with HSL_MC80 
reordering and equilibration and have found it to be comparable with the best of the 4 
combinations above. The full test data for all 4 combinations as well as tests with MG80 
can be found in Table jVIj of the appendix. For the incomplete factorization, we apply 
rook pivoting. We observe that ILDL achieves similar iteration counts with a far sparser 
preconditioner. Furthermore, even for cases where ILDL was beaten on iteration count, we 
see that the denser factor of ILUTP causes the overall solve time to be much slower than 
ILDL. When ILDL and ILUTP have similar fill, ILDL converges in fewer iterations. 

In Figure]^ we examine the sensitivity of ILDL to input tolerance. We plot the number 
of iterations and the timings for a changing value of the tolerance. We observe the expected 
behavior. As the tolerance decreases, there is a tradeoff between preconditioning time and 
iteration count. Thus, the total computational time is high at both extremes. That said. 
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there is a large range of values of tolerance for which both time and iterations are modest. 
Altogether, ILDL works well for all test cases with fairly generic parameters. 



tolerance 


Fig. 3: The number of SQMR iterations and total precondition+solve time as a function 
of tolerance. Tests were performed on the tumal matrix with Bunch equilibration, AMD 
reordering, and actor set to oo (to measure only the tolerance dropping rule). 


5.1.2. Further comparisons with Matlab's ILUTP. To show the memory efficiency of our code, 
we consider matrices associated with the discrete Helmholtz equation, 

—Au — au = f, (1) 

subject to Dirichlet boundary conditions on the unit square, discretized using a uniform 
mesh of size h. Here we choose a moderate value of a, so that a symmetric indefinite matrix 
is generated. The choice of a may have a significant impact on the conditioning of the 
matrix. In particular, if a is an eigenvalue, then the shifted matrix is singular. In SYM- 
ILDL, a singular matrix will trigger static pivoting, and may add a significant computational 
overhead. In our numerical experiments we have stayed away from choices of a such that 
the shifted matrix is singular. Although we could have used the same matrices as Table [III 
additional tests using Helmholtz matrices provide a greater degree of insight as we know its 
spectra and can easily control the dimension and number of non-zeros. 

In Table |Tn| we present results for the Helmholtz model problem. We compare SYM-ILDL 
to Matlab’s ILUTP. For ILUTP we used a drop tolerance of 10“^ in all test cases. For 
ILDL, the f ill_factor was set to oo (since ILUTP does not limit its intermediate memory 
by a fill factor) while the drop_tol parameter was then chosen to get roughly the same 
fill as that of ILUTP. In the context of the ILUTP preconditioner, the fill is defined as 
nnz(L -I- ?7)/nnz(A). 
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Table II: Factorization timings and SQMR iterations for test matrices 


matrix 

n 

nnz(A) 

fill 

time (s) 

type 

iterations 

augSdcqp 

35543 

128115 

1.9 

0.05 + 0.15 

ILDL(B+AMD) 

24 




7.3 

2.66+0.20 

ILUTP(B+AMD) 

6 

bloweya 

30004 

150009 

1.0 

0.07 + 0.02 

ILDL(MC64+MC64R) 

3 




3.2 

7.86+0.10 

ILUTP(B+MC64R) 

3 

bratuSd 

27792 

173796 

3.8 

0.25 + 0.11 

ILDL(B+MC64R) 

18 




8.1 

8.50+0.54 

ILUTP (B+MC64R) 

11 

tumal 

22967 

87760 

3.0 

0.05 + 0.13 

ILDL(MC64+MC64R) 

35 




7.8 

2.68+0.58 

ILUTP(B+AMD) 

14 

tuma2 

12992 

49365 

3.0 

0.03 + 0.09 

ILDL(MC64+MC64R) 

28 




6.9 

0.72+0.23 

ILUTP(B+AMD) 

13 

boydl 

93279 

1211231 

1.0 

0.10 + 0.50 

ILDL(B+AMD) 

3 




0.8 

0.26+0.86 

ILUTP(B+MC64R) 

10 

brainpc2 

27607 

179395 

1.0 

0.31 + 0.10 

ILDL(MC64+MC64R) 

31 




0.6 

0.54+38.7 

ILUTP(B+AMD) 

NC 

marioOOl 

38434 

204912 

3.7 

0.13 + 0.56 

ILDL(B+MC64R) 

52 




8.0 

2.47+0.54 

ILUTP(B+AMD) 

8 

qpband 

20000 

45000 

1.1 

0.008 + 0.004 

ILDL(B+AMD) 

1 




1.1 

0.008+0.021 

ILUTP (B+AMD) 

1 

nlpkktSO 

1062400 

28192672 

8.0 

153 + 53 

ILDL(B+MC64R) 

34 




4.1 

6803+2502 

ILUTP(B+AMD) 

NC 

nlpkktl20 

3542400 

95117792 

8.0 

525 + 334 

ILDL(B+MC64R) 

58 




- 

- 

ILUTP 

- 


The experiments were run with fill_factor = 2.0 for the smaller matrices and fill_factor = 4.0 
for matrices larger than one million in dimension. The tolerance was drop_tol = 10“'^, and we used 
rook pivoting to maintain stability. The iteration was terminated when the norm of the relative 
residual went below 10“^. The time is reported as x + y, where x is the preconditioning time and y 
is the iterative solver time. Times labelled with took over 10 hours to run, and were terminated 
before completion. Iteration counts labelled with NC indicates that the problem did not converge 
within 1000 iterations. 


For both ILDL and ILUTP, GMRES(IOO) was used as the iterative solver and the input 
matrix was scaled with Bunch equilibration and reordered with AMD. During the computa¬ 
tion of the preconditioner, the in-place version of ILDL uses only about 2/3 of the memory 
used by ILUTP. During the GMRES solve, the ILDL preconditioner only uses about 1/2 
the memory used by ILUTP. We note that ILDL could also be used with SQMR, which has 
a much smaller memory footprint than GMRES. 

We observe that the performance of ILDL on the Helmholtz model problem is dependent 
on the value of a chosen, but that if ILDL is given the same memory resources as ILUTP, 
ILDL outperforms ILUTP. The memory usage of ILUTP and ILDL are measured through 
the MATLAB profiler. For a = 0.3//i^, the ILDL approach leads to lower iteration counts 
even when approximately 1/2 of the memory is allocated (i.e., when the same fill is allowed), 
whereas for a = Q.ljh?^ ILUTP outperforms ILDL when the fill is roughly the same. If we 
allow ILDL to have memory usage as large as ILUTP (i.e., up to roughly 3/2 the fill), we 
see that ILDL clearly has lower iteration counts for GMRES. 


5.1.3. 


Comparison s with HSL_MI30. In Table VIII we compare our code to the code of [Scott 


and Tuma 2014a 
done in IScott and Tuma 2014a 


implemented in the package HSL_MI30. This comparison was already 
with an older version of SYM-ILDL. However 


with re¬ 
cent improvements, we see that SYM-ILDL generally takes 2-6 times fewer iterations than 
HSL_MI30. The matrices we compare with are a subset of the matrices used in the original 
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Table III: Comparison of Matlab’s ILUTP and SYM-ILDL for Helmholtz matrices 


matrix 

n 

nnz(A) 

ilu fill 

ilu gmres iters 

ildl fill 

ildl gmres iters 

a = 0.3/h^, Extra memory for ILUTP 

helmholtzSO 

6400 

31680 

7.7 

11 

7.6 

8 

helmholtzl20 

14400 

71520 

10.6 

13 

10.3 

8 

helmholtzlbO 

25600 

127360 

12.3 

17 

12.3 

8 

helmholtz200 

40000 

199200 

14.1 

24 

14.0 

11 

a = O.Tjh^, Extra memory for ILUTP 

helmholtzSO 

6400 

31680 

7.4 

5 

7.5 

8 

helmholtzl20 

14400 

71520 

13.4 

10 

14.0 

18 

helmholtzlbO 

25600 

127360 

16.4 

15 

16.7 

43 

helmholtz200 

40000 

199200 

19.5 

20 

20.8 

86 

a = 0.7!, Equal memory for ILDL and ILUTP 

helmholtzSO 

6400 

31680 

7.4 

5 

11.0 

6 

helmholtzl20 

14400 

71520 

13.4 

10 

18.6 

6 

helmholtzlbO 

25600 

127360 

16.4 

15 

22.8 

8 

helmholtz200 

40000 

199200 

19.5 

20 

33.0 

11 


The parameter a in Equation]^ is indicated above. GMRES was terminated when the relative residual decreased below 10 


comparison. In particular, these matrices were ones for which SYM-ILDL performed the 
most poorly in the origin al comparison. The m atrices were obtained from the University of 
Florida matrix collection Davis and Hu 2011 . 

The parameters used here are almost the same as in the original comparison. For 
HSL_MI30, we used the built-in MATLAB interface, set Isize and rsize to 30, a(l : 2) 
both to 0.1, the drop tolerances ti and T 2 to 10“^ and 10“^, and used the built-in MC77 
equilibration (which performed the best out of all possible equilibration options, including 
MC64). We also tried all possible reordering options for HSL_MI30 and found that the natu¬ 
ral ordering performed the best. For SYM-ILDL, we used a f ill_factor of 12.0, drop_tol 
of 0.003, as in the original comparison. The only difference between the original comparison 
and this one is that rook pivoting is used for stability and MC80 is used for equilibration 
and reordering. We have also performed additional tests using MC64 for equilibration and 
AMD for reordering and have found comparable number of iterations with higher fill. All 
tests can be found in the appendix. 


Table IV: GMRES comparisons between SYM-ILDL and HSL_MI30 


Matrix name 


nnz(A) 1 

fillMISO 

MI30 iters 

time (s) 

fillSYM-ILDL 

SYM-ILDL iters 

time (s) 

c-55 

32780 

403450 

3.45 

49 

1.25-1-0.94 

2.95 

15 

0.23+0.15 

c-59 

41282 

480536 

3.62 

70 

1.59-1-1.84 

2.99 

15 

0.36+0.20 

c-63 

44234 

434704 

4.10 

51 

1.53-1-1.23 

2.92 

15 

0.29+0.21 

c-68 

64810 

565996 

4.12 

37 

1.87-1-1.12 

2.31 

9 

0.31+0.17 

c-69 

67458 

623914 

4.33 

43 

4.07-1-1.47 

2.65 

9 

0.35+0.18 

c-70 

68924 

658986 

4.26 

38 

3.77-1-1.30 

2.67 

11 

0.40+0.24 

c-71 

76638 

859520 

3.58 

61 

3.93-1-2.71 

3.00 

12 

0.74+0.32 

c-72 

84064 

707546 

4.18 

54 

3.05-1-2.40 

2.69 

9 

0.40+0.31 

c-big 

345241 

2340859 

4.82 

67 

23.4-1-25.3 

2.54 

8 

1.20+0.93 


For each test case, we report the time it takes to compute the preconditioner, as well as the GMRES time and the number of 
GMRES iterations. The time is reported as x + y, where x is the preconditioning time and y is the GMRES time. GMRES 
was terminated when the relative residual decreased below 10~®. 























15 


We note that although we set the fill_factor to be 12.0 in all comparisons with 
HSL_MI30, SYM-ILDL can have similar performance with a much smaller fill_factor. 

5.2. Results for skew-symmetric matrices 

We test with a skew-symmetrized version of a model convection-diffusion equation, which 
is a discrete version of 


-Au+ {a,T,fi)\7u = f, (2) 

with Dirichlet boundary conditions on the unit square, discretized using a uniform mesh of 
size h. We define the mesh Peclet numbers 

/3 = ahj2, 7 = Th/2, S = fj,h/2. 

We use the skew-symmetric part of this matrix (that is, given A, form for our 

skew-symmetric experiments. 

In our tests, we have found that equilibration has not been particularly effective. We 
speculate that this might have to do with a property related to block diagonal dominance 
that these matrices have for certain values of the convective coefficients. Specifically, the 
norm of the tridiagonal part of the matrix is significantly larger than the norm of the 
remaining part. Equilibration tends to adversely affect this property by scaling down entries 
near the diagonal, and as a result the performance of an iterative solver often degrades. We 
thus do not apply equilibration in our skew-symmetric solver. 

In Table [V] we manipulate the drop tolerance for ILDL, to obtain a fill nearly equal to 
that of ILUTP. For the latter we fix the drop tolerance at 0.001. This is done for the purpose 
of comparing the performance of the iterative solvers, when the memory requirements of 
ILUTP and ILDL are similar. Prior to preconditioning, we apply AMD as a fill-reducing 
reordering. We apply preconditioned GMRES(IOO) to solve the linear system, until either 
a residual of 10“® is reached, or until 1000 iterations are used. If the linear system fails 
to converge after 1000 iterations, we mark it as NC. We see that the iteration counts are 
significantly better for ILDL, especially when rook pivoting is used. Note that our ILDL 
still consumes only about 2/3 of the memory of ILUTP, due to the fact that the floating 
point entries of only half of the matrix are stored. 

In FigureHwe show the (complex) e^envalues of the preconditioned matrix {LDL'^)~^ A, 
where A is the skew-symmetric part ofl^with convective coefficients (/3, 7 , <5) = (0.4,0.5, 0.6), 
and LDL^ is the preconditioner generated by running SYM-ILDL with a drop tolerance of 
10 “^ and a fill-in parameter of 20 . 

For the purpose of comparison, we also show the unpreconditioned eigenvalues. As seen 
in the figure, most of the preconditioned eigenvalues are very strongly clustered around 1 , 
which indicates that a preconditioned iterative solver is expected to rapidly converge. The 
unpreconditioned eigenvalues are pure imaginary, and follow the formula 

2 (/3 cos (jirh) + 7 cos (kirh) + S cos {iirh)) i 


where I < j,k,£ < 1/h. 

6. OBTAINING AND CONTRIBUTING TO SYM-ILDL 

SYM-ILDL is open source, and documentation can be found at http://www.cs.ubc.ca/ 
~greif/code/sym-ildl.html. We essentially allow free use of our software with no restric¬ 
tions. To this end, SYM-ILDL uses the MIT Software License. 

We welcome any contributions to our code. Details on the contribution process can be 
found with the link above. Certainly, more code optimization is possible, such as paralleliza¬ 
tion; such tasks remain as items for future work. 
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Table V: Comparison of Matlab’s ILUTP and SYM-ILDL for a skew-symmetric matrix 
arising from a model convection-diffusion equation 


n 


nnz(A) 

method 

drop tol 

fill 

GMRES(20) 

time (s) 




ILDL-rook 

10 -^ 

7.008 

6 

0.130+0.041 

20^ 

= 8000 

45600 

ILDL-partial 

5 ■ 10-4 

6.861 

6 

0.138+0.041 




ILUTP 

CO 

1 , 

0 

1—1 

7.758 

8 

0.406+0.038 




ILDL-rook 

2 ■ lO”"' 

10.973 

8 

0.936+0.246 

303 

= 27000 

156600 

ILDL-partial 

CO 

0 

1 

11.235 

10 

1.162+0.331 




ILUTP 

10-3 

11.758 

13 

4.475+0.307 




ILDL-rook 

9■10-5 

15.205 

9 

3.820+0.855 

40® 

= 64000 

374400 

ILDL-partial 

CO 

0 

1 

15.686 

18 

4.971+1.582 




ILUTP 

10-3 

15.654 

19 

26.63+1.40 




ILDL-rook 

2■10-5 

21.560 

6 

15.39+1.76 

50^ 

= 125000 

735000 

ILDL-partial 

2 ■ 10-"^ 

22.028 

62 

21.11+17.95 




ILUTP 

CO 

1 

0 

1—1 

22.691 

58 

151.14+11.60 




ILDL-rook 

2■10-5 

22.595 

9 

34.82+4.02 

603 

= 216000 

1274400 

ILDL-partial 

4■10-^ 

22.899 

NC 

36.17+NC 




ILUTP 

10-3 

23.483 

NC 

356.60+NC 




ILDL-rook 

5■10-s 

32.963 

5 

106.81+3.25 

703 

= 343000 

2028600 

ILDL-partial 

4■10-4 

36.959 

NC 

156.52+NC 




ILUTP 

10-3 

33.861 

NC 

876.33+NC 


The parameter used were 0 = 20,7 = 2, <5 = 1. The Matlab ILUTP used a drop tolerance of 0.001. 
‘NC’ stands for ‘no convergence’. 




real(eig(A)) 


Fig. 4: Eigenvalues of an unpreconditioned (left) and preconditioned (right) skew-symmetric 
1000 X 1000 matrix A arising from a convection-diffusion model problem. 
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Table VI: Factorization timings and iterative solver iterations for test matrices 


matrix 

n 

nnz(^) 

fill 

time (s) 

type 

iterations 




1.9 

0.051+0.148 

ILDL(B+AMD) 

24 




3.3 

0.063+0.442 

ILDL(MC64+MC64R) 

55 

augSdcqp 

35543 

128115 

2.1 

0.068+0.261 

1LDL(MC64+AMD) 

33 




3.2 

0.063+0.223 

ILDL(B+MC64R) 

33 




7.3 

2.655+0.198 

ILUTP(B+AMD) 

6 




21.2 

11.674+0.890 

ILUTP(MC64+MC64R) 

14 




36.0 

11.513+0.397 

ILUTP(MC64+AMD) 

6 




7.4 

1.753+0.215 

ILUTP(B+MC64R) 

7 




0.9 

0.030+0.081 

ILDL(B+AMD) 

18 




1.0 

0.071+0.014 

ILDL(MC64+MC64R) 

3 

bloweya 

30004 

150009 

1.0 

0.023+0.019 

ILDL(MC64+AMD) 

5 




0.9 

0.152+0.126 

ILDL(B+MC64R) 

18 




2.8 

38.817+0.101 

ILUTP(B+AMD) 

4 




6.1 

2.726+0.109 

ILUTP(MC64+MC64R) 

4 




2.9 

39.537+0.104 

ILUTP(MC64+AMD) 

4 




3.2 

7.858+0.100 

ILUTP(B+MC64R) 

3 




3.8 

0.358+0.155 

ILDL(B+AMD) 

23 




3.6 

0.155+0.124 

1LDL(MC64+MC64R) 

24 

bratuSd 

27792 

173796 

3.6 

0.231+0.272 

1LDL(MC64+AMD) 

36 




3.8 

0.245+0.105 

ILDL(B+MC64R) 

18 




8.6 

22.237+0.214 

ILUTP(B+AMD) 

9 




10.3 

13.114+0.962 

ILUTP(MC64+MC64R) 

18 




9.8 

32.717+0.500 

ILUTP(MC64+AMD) 

10 




8.1 

8.480+0.540 

ILUTP(B+MC64R) 

11 




2.9 

0.044+0.201 

ILDL(B+AMD) 

50 




3.0 

0.051+0.132 

ILDL(MC64+MC64R) 

35 

tumal 

22967 

87760 

3.0 

0.077+0.299 

ILDL(MC64+AMD) 

54 




3.0 

0.046+0.220 

ILDL(B+MC64R) 

59 




7.8 

2.686+0.582 

ILUTP(B+AMD) 

14 




40.0 

19.476+0.495 

ILUTP(MC64+MC64R) 

8 




20.7 

7.268+0.242 

ILUTP(MC64+AMD) 

6 




17.7 

7.750+51.991 

ILUTP(B+MC64R) 

NC 




2.8 

0.023+0.084 

ILDL(B+AMD) 

41 




3.0 

0.029+0.087 

1LDL(MC64+MC64R) 

28 

tuma2 

12992 

49365 

3.0 

0.045+0.104 

ILDL(MC64+AMD) 

34 




3.0 

0.041+0.218 

ILDL(B+MC64R) 

55 




6.9 

0.720+0.226 

ILUTP(B+AMD) 

13 




33.8 

4.140+0.192 

ILUTP(MC64+MC64R) 

7 




19.0 

1.936+0.106 

ILUTP(MC64+AMD) 

5 




15.5 

2.082+12.341 

ILUTP(B+MC64R) 

697 



20 





1.0 

0.155+0.077 

ILDL(B+AMD) 

3 




0.6 

0.102+0.505 

ILDL(MC64+MC64R) 

42 

boydl 

93279 

1211231 

1.0 

0.123+0.088 

ILDL(MC64+AMD) 

6 




0.6 

0.144+0.437 

ILDL(B+MC64R) 

36 




0.8 

0.219+1.021 

ILUTP(B+AMD) 

10 




0.8 

0.257+0.875 

ILUTP(MC64+MC64R) 

12 




0.8 

0.233+1.656 

ILUTP(MC64+AMD) 

14 




0.8 

0.188+0.481 

ILUTP(B+MC64R) 

10 




1.0 

0.878+0.094 

ILDL(B+AMD) 

31 




1.8 

0.315+0.100 

ILDL(MC64+MC64R) 

31 

brainpc2 

27607 

179395 

1.5 

1.661+0.085 

ILDL(MC64+AMD) 

28 




1.8 

0.481+0.983 

ILDL(B+MC64R) 

214 




0.6 

0.541+38.711 

ILUTP{B+AMD) 

NC 




961.5 

373.210+1210.140 

ILUTP{MC64+MC64R) 

NC 




88.7 

15.434+180.070 

ILUTP(MC64+AMD) 

NC 




0.6 

0.925+38.263 

ILUTP(B+MC64R) 

NC 




3.7 

0.163+0.541 

ILDL(B+AMD) 

54 




3.6 

0.234+0.629 

ILDL(MC64+MC64R) 

55 

marioOOl 

38434 

204912 

3.6 

0.213+0.603 

ILDL(MC64+AMD) 

54 




3.7 

0.129+0.557 

ILDL(B+MC64R) 

52 




8.0 

2.474+0.542 

ILUTP{B+AMD) 

8 




9.3 

26.39+0.612 

ILUTP(MC64+MC64R) 

8 




9.0 

2.552+0.555 

ILUTP(MC64R+AMD) 

8 




8.6 

21.73+0.325 

ILUTP(B+MC64) 

9 




1.1 

0.008+0.004 

ILDL(B+AMD) 

1 




1.1 

0.007+0.004 

ILDL(MC64+MC64R) 

1 

qpband 

20000 

45000 

1.8 

0.014+0.004 

ILDL(MC64+AMD) 

1 




1.1 

0.016+0.004 

ILDL(B+MC64R) 

1 




1.1 

0.008+0.026 

ILUTP{B+AMD) 

1 




1.1 

0.008+0.021 

ILUTP(MC64+MC64R) 

1 




1.2 

0.011+0.028 

ILUTP(MC64+AMD) 

1 




1.1 

0.010+0.013 

ILUTP{B+MC64R) 

1 
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9.5 

113+1308 

ILDL(B+AMD) 

998* 


14.5 

176+1580 

ILDL(MC64+MC64R) 

854* 

nlpkktSO 

1062400 28192672 12.3 

153+53 

ILDL(MC64+AMD) 

34 


10.6 

121+NC 

ILDL(B+MC64R) 

NC 


4.1 

6803 + 2502 

ILUTP(B+AMD) 

NC 


- 

- 

ILUTP(MC64+MC64R) 

- 


- 

- 

ILUTP(MC64+AMD) 

- 


- 

- 

ILUTP(B+MC64) 

- 


9.8 

401+NC 

ILDL(B+AMD) 

NC 


14.5 

533+NC 

ILDL(MC64+MC64R) 

NC 

nlpkktl20 

3542400 95117792 12.4 

525+334 

ILDL(MC64+AMD) 

58 


10.9 

460+NC 

ILDL(B+MC64R) 

NC 


- 

- 

ILUTP(B+AMD) 

- 


- 

- 

ILUTP(MC64+MC64R) 

- 


- 

- 

ILUTP(MC64+AMD) 

- 


- 

- 

ILUTP(B+MC64) 

- 


The experiments were run with fill_factor = 2.0 for the smaller matrices and fill_factor = 
4.0 for matrices larger than one million in dimension. The tolerance was drop_tol = and 

we used rook pivoting to maintain stability. The iteration was terminated when the norm of the 
relative residual went below 10“^. For iteration counts labelled with a *, MINRES was used (as 
SQMR failed to converge). Iteration counts labelled with NC indicates non-convergence for both 
MINRES and SQMR. Times labelled with took over 10 hours to run, and were terminated 
before completion. 
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The table below uses HSL_MC80 on matrices from Table 1 and 3 in this paper. For MC80, 
we chose AMD as the fill-reducing reordering after the matching stage. Only matrices from 
these two tables were used as all other matrices in our tests were well-scaled and block- 
diagonally dominant to begin with (such as the Helmholtz problem of Table 2). 

Table VII: Results with HSL_MC80 for matrices in Table 1 and 3 


matrix 

n 

nnz(A) 

fill 

time (s) 

iterations 

aug3dcqp 

35543 

128115 

2.0 

0.051+0.188 

28 

bloweya 

30004 

150009 

0.9 

0.036+0.023 

4 

bratu3d 

27792 

173796 

2.9 

0.118+0.106 

26 

tumal 

22967 

87760 

3.0 

0.063+0.227 

44 

tuma2 

12992 

49365 

2.9 

0.033+0.094 

35 

boydl 

93279 

1211231 

1.0 

0.120+0.062 

4 

brainpc2 

27607 

179395 

1.5 

0.086+0.119 

26 

marioOOl 

38434 

204912 

3.6 

0.110+0.501 

59 

qpband 

20000 

45000 

1.1 

0.015+0.004 

1 

nlpkktSO 

1062400 

28192672 

7.1 

133-h86 

49 

nlpkktl20 

3542400 

95117792 

- 

x-hx 

- 

c-55 

32780 

403450 

2.95 

0.28+0.15 

15 

c-59 

41282 

480536 

2.99 

0.36+0.20 

15 

c-63 

44234 

434704 

2.92 

0.29+0.21 

15 

c-68 

64810 

565996 

2.31 

0.31+0.17 

9 

c-69 

67458 

623914 

2.65 

0.35+0.18 

9 

c-70 

68924 

658986 

2.67 

0.40+0.24 

11 

c-71 

76638 

859520 

3.00 

0.74+0.32 

12 

c-72 

84064 

707546 

2.69 

0.40+0.21 

9 

c-big 

345241 

2340859 

2.54 

1.2+0.93 

8 


Matrices in the first section (delimited by horizontal lines) were run with 
fill_factor = 2.0 and drop_tol = 10“'^. Matrices in the second section 
were run with fill_factor = 4.0 and drop_tol = 10““^. Matrices in 
the third section were run with fill_factor = 12.0 and drop_tol = 
0.003. Rook pivoting was used to maintain stability. The iterative solvers 
used for the first two sections was SQMR, and GMRES was used for the 
third section. These settings maintain consistency with Tables 1 and 3 of 
Section 5. The iteration was terminated when the norm of the relative 
residual went below 10~^. Iteration counts labelled with NC indicates 


non-convergence. 
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Table VIII: GMRES comparisons between SYM-ILDL and AMD with MC64 equilibration 


Matrix name 

n 

nnz(A) 

fillMISO 

MI30 iters 

time (s) 

fillSYM-ILDL 

SYM-ILDL iters 

time (s) 

c-55 

32780 

403450 

3.45 

49 

1.25+0.94 

3.85 

12 

0.49+0.15 

c-59 

41282 

480536 

3.62 

70 

1.59+1.84 

3.70 

13 

0.59+0.27 

c-63 

44234 

434704 

4.10 

51 

1.53+1.23 

4.12 

13 

0.48+0.25 

c-68 

64810 

565996 

4.12 

37 

1.87+1.12 

4.00 

9 

0.69+0.26 

c-69 

67458 

623914 

4.33 

43 

4.07+1.47 

3.93 

11 

0.64+0.34 

c-70 

68924 

658986 

4.26 

38 

3.77+1.30 

3.46 

13 

0.58+0.42 

c-71 

76638 

859520 

3.58 

61 

3.93+2.71 

4.09 

10 

1.13+0.40 

c-72 

84064 

707546 

4.18 

54 

3.05+2.40 

5.33 

14 

1.15+0.59 

c-big 

345241 

2340859 

4.82 

67 

23.4+25.3 

2.92 

11 

1.89+1.62 


For each test case, we report the time it takes to compute the preconditioner, as well as the GMRES time and the number of 
GMRES iterations. The time is reported as ai + y, where x is the preconditioning time and y is the GMRES time. GMRES 
was terminated when the relative residual decreased below 10“^. 



